Principa

Principal Curves
Author(s): Trevor Hastie and Werner Stuetzle
Source: Journal of the American Statistical Association, Vol. 84, No. 406 ( J u n . , 1989), p p . 502-
516
Published by: American Statistical Association
Stable URL: http://www.jstor.org/stable/2289936
Accessed: 24/09/2009 14:52

Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless
you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you
may use content in the JSTOR archive only for your personal, non-commercial use.

Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
http://www.jstor.org/action/showPublisher?publisherCode=astata.

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.

JSTOR is a not-for-profit organization founded in 1995 to build trusted digital archives for scholarship. We work with the
scholarly community to preserve their work and the materials they rely upon, and to build a common research platform that
promotes the discovery and use of these resources. For more information about JSTOR, please contact support@jstor.org.

American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journal
of the American Statistical Association.

http://www.jstor.org

Principal Curves
TREVOR HASTIE a n d WERNER STUETZLE*

Principal curves are smooth one-dimensional curves that pass through the middle of a p-dimensional data set, providing a
nonlinear summary of the data. They are nonparametric, and their shape is suggested by the data. The algorithm for constructing
principal curves starts with some prior summary, such as the usual principal-component line. The curve in each successive
iteration is a smooth or local average of the p-dimensional points, where the definition of local is based on the distance in arc
length of the projections of the points onto the curve found in the previous iteration. In this article principal curves are defined,
an algorithm for their construction is given, some theoretical results are presented, and the procedure is compared to other
generalizations of principal components. Two applications illustrate the use of principal curves. The first describes how the
principal-curve procedure was used to align the magnets of the Stanford linear collider. The collider uses about 950 magnets
in a roughly circular arrangement to bend electron and positron beams and bring them to collision. After construction, it was
found that some of the magnets had ended up significantly out of place. As a result, the beams had to be bent too sharply and
could not be focused. The engineers realized that the magnets did not have to be moved to their originally planned locations,
but rather to a sufficiently smooth arc through the middle of the existing positions. This arc was found using the principal-
curve procedure. In the second application, two different assays for gold content in several samples of computer-chip waste
appear to show some systematic differences that are blurred by measurement error. The classical approach using linear errors
in variables regression can detect systematic linear differences but is not able to account for nonlinearities. When the first linear
principal component is replaced with a principal curve, a local "bump" is revealed, and bootstrapping is used to verify its
presence.
KEY WORDS: Errors in variables; Principal components; Self-consistency; Smoother; Symmetric.

1. INTRODUCTION component line in Figure lb does just this—it is found by
minimizing the orthogonal deviations.
Consider a data set consisting of n observations on two
Linear regression has been generalized to include non-
variables, x and y. We can represent the n points in a
linear functions of x. This has been achieved using
scatterplot, as in Figure la. It is natural to try and sum-
predefined parametric functions, and with the reduced
marize the pattern exhibited by the points in the scatter-
cost and increased speed of computing nonparametric
plot. The type of summary we choose depends on the goal
scatterplot smoothers have gained popularity. These
of our analysis; a trivial summary is the mean vector that
include kernel smoothers (Watson 1964), nearest-neighbor
simply locates the center of the cloud but conveys no in-
smoothers (Cleveland 1979), and spline smoothers (Sil-
formation about the joint behavior of the two variables.
verman 1985). In general, scatterplot smoothers produce
It is often sensible to treat one of the variables as a
a curve that attempts to minimize the vertical deviations
response variable and the other as an explanatory variable.
(as depicted in Fig. lc), subject to some form of smooth-
Hence the aim of the analysis is to seek a rule for predicting
ness constraint. The nonparametric versions referred to
the response using the value of the explanatory variable.
before allow the data to dictate the form of the nonlinear
Standard linear regression produces a linear prediction
dependency.
rule. The expectation of у is modeled as a linear function
of x and is usually estimated by least squares. This pro- We consider similar generalizations for the symmetric
cedure is equivalent to finding the line that minimizes the situation. Instead of summarizing the data with a straight
sum of vertical squared deviations (as depicted in Fig. la). line, we use a smooth curve; in finding the curve we treat
In many situations we do not have a preferred variable the two variables symmetrically. Such curves pass through
that we wish to label "response," but would still like to the middle of the data in a smooth way, whether or not
summarize the joint behavior of x and y. The dashed line the middle of the data is a straight line. This situation is
in Figure la shows what happens if we used x as the re- depicted in Figure Id. These curves, like linear principal
sponse. So, simply assigning the role of response to one components, focus on the orthogonal or shortest distance
of the variables could lead to a poor summary. An obvious to the points. We formally define principal curves to be
alternative is to summarize the data by a straight line that those smooth curves that are self-consistent for a distri-
treats the two variables symmetrically. The first principal- bution or data set. This means that if we pick any point
on the curve, collect all of the data that project onto this
point, and average them, then this average coincides with
* Trevor Hastie is Member of Technical Staff, AT&T Bell Labora-
tories, Murray Hill, NJ 07974. Werner Stuetzle is Associate Professor, the point on the curve.
Department of Statistics, University of Washington, Seattle, WA 98195. The algorithm for finding principal curves is equally
This work was developed for the most part at Stanford University, with intuitive. Starting with any smooth curve (usually the larg-
partial support from U.S. Department of Energy Contracts DE-AC03-
76SF and DE-AT03-81-ER10843, U.S. Office of Naval Research Con- est principal component), it checks if this curve is self-
tract N00014-81-K-0340, and U.S. Army Research Office Contract consistent by projecting and averaging. If it is not, the
DAAG29-82-K-0056. The authors thank Andreas Buja, Tom Duchamp, procedure is repeated, using the new curve obtained by
Iain Johnstone, and Larry Shepp for their theoretical support, Robert
Tibshirani, Brad Efron, and Jerry Friedman for many helpful discussions
and suggestions, Horst Friedsam and Will Oren for supplying the Stan- © 1989 American Statistical Association
ford linear collider example and their help with the analysis, and both Journal of the American Statistical Association
referees for their constructive criticism of earlier drafts. June 1989, Vol. 84, No. 406, Theory and Methods
502

Hastie and Stuetzle: Principal Curves 503

Figure 1. (a) The linear regression line minimizes the sum of squared deviations in the response variable, (b) The principal-component line
minimizes the sum of squared deviations in all of the variables, (c) The smooth regression curve minimizes the sum of squared deviations in the
response variable, subject to smoothness constraints, (d) The principal curve minimizes the sum of squared deviations in all of the variables,
subject to smoothness constraints.

averaging as a starting guess. This is iterated until (hope- servable variables called factors. Often the models are
fully) convergence. estimated using linear principal components; in the case
The largest principal-component line plays roles other of one factor [Eq. (1), as follows] one could use the largest
than that of a data summary: principal component. Many variations of this model have
appeared in the literature.
1. In errors-in-variables regression it is assumed that
there is randomness in the predictors as well as the re- In all the previous situations the model can be written as
sponse. This can occur in practice when the predictors are
measurements of some underlying variables and there is x f = u 0 + aA, + e / ? (1)
error in the measurements. It also occurs in observational
studies where neither variable is fixed by design. The er- where u 0 + a A, is the systematic component and e, is the
rors-in-variables regression technique models the expec- random component. If we assume that cov(e,) = a 2 /, then
tation of у as a linear function of the systematic component the least squares estimate of a is the first linear principal
of x. In the case of a single predictor, the model is esti- component.
mated by the principal-component line. This is also the A natural generalization of (1) is the nonlinear model
total least squares method of Golub and van Loan (1979). ж, = f(A,) + e,-. (2)
More details are given in an example in Section 8.
2. Often we want to replace several highly correlated This might then be a factor analysis or structural model,
variables with a single variable, such as a normalized linear and for two variables and some restrictions an errors-in-
combination of the original set. The first principal com- variables regression model. In the same spirit as before,
ponent is the normalized linear combination with the larg- where we used the first linear principal component to es-
est variance. timate (1), the techniques described in this article can be
3. In factor analysis we model the systematic compo- used to estimate the systematic component in (2).
nent of the data by linear functions of a small set of unob- We focus on the definition of principal curves and an

504 Journal of the American Statistical Association, June 1989

algorithm for finding them. We also present some theo-
retical results, although many open questions remain.

2. THE PRINCIPAL CURVES OF A
PROBABILITY DISTRIBUTION
We first give a brief introduction to one-dimensional
curves, and then define the principal curves of smooth
probability distributions in p space. Subsequent sections
give algorithms for finding the curves, both for distribu-
tions and finite realizations. This is analogous to motivat-
ing a scatterplot smoother, such as a moving average or
kernel smoother, as an estimator for the conditional ex-
pectation of the underlying distribution. We also briefly
discuss an alternative approach via regularization using
smoothing splines.

2.1 One-Dimensional Curves
Figure 2. The radius of curvature is the radius of the circle tangent
A one-dimensional curve in p-dimensional space is a to the curve with the same acceleration as the curve.
vector f(A) of p functions of a single variable A. These
functions are called the coordinate functions, and A pro- 2.2 Definition of Principal Curves
vides an ordering along the curve. If the coordinate func-
Denote by X a random vector in R p with density h and
tions are smooth, then f is by definition a smooth curve.
finite second moments. Without loss of generality, assume
We can apply any monotone transformation to A, and by
£(X) = 0. Let f denote a smooth (C00) unit-speed curve
modifying the coordinate functions appropriately the
in R p parameterized over Л С R 1 , a closed (possibly in-
curve remains unchanged. The parameterization, how-
finite) interval, that does not intersect itself (A2 ^ A2 Ф
ever, is different. There is a natural parameterization for
f(Ai) Ф f(A2)) and has finite length inside any finite ball
curves in terms of the arc length. The arc length of a curve
in R p .
f from A, to kx is given by / = JJj l'{z)dz. И ||f'(z)|| -
We define the projection index Af: R p R 1 as
1, then / = Ai - A0. This is a desirable situation, since if
all of the coordinate variables are in the same units of Af(x) = sup{A: ||x - f(A)|| = inf||x - f(//)||}. (3)
measurement, then A is also in those units. Я . pi
The vector f'(A) is tangent to the curve at A and is The projection index Af(x) of x is the value of A for which
sometimes called the velocity vector at A. A curve with f(A) is closest to x. If there are several such values, we
||f'|| = 1 is called a unit-speed parameterized curve. We pick the largest one. We show in the Appendix that Af(x)
can always reparameterize any smooth curve with ||f'|| > is well defined and measurable.
0 to make it unit speed. In addition, our intuitive concept
of smoothness relates more naturally to unit-speed curves. Definition 1. The curve f is called self-consistent or
For a unit-speed curve, smoothness of the coordinate func- a principal curve of h if E(X | Af(X) = A) = f(A) for
tions translates directly into smooth visual appearance a.e. A.
of the point set {f(A), A E Л} (absence of sharp bends). Figure 3 illustrates the intuitive motivation behind our
If v is a unit vector, then f(A) = v 0 + Av is a unit-speed definition of a principal curve. For any particular param-
straight line. This parameterization is not unique: /*(A) eter value A we collect all of the observations that have
= u + ay + Av is another unit-speed parameterization f(A) as their closest point on the curve. If f(A) is the av-
for the same line. In the following we always assume that erage of those observations, and if this holds for all A, then
<u, v) = 0. f is called a principal curve. In the figure we have actually
The vector Г(А) is called the acceleration of the curve averaged observations projecting into a neighborhood on
at A, and for a unit-speed curve it is easy to check that it the curve. This gives the flavor of our data algorithms to
is orthogonal to the tangent vector. In this case f"/||f"|| is come; we need to do some kind of local averaging to
called the principal normal to the curve at A. The vectors estimate conditional expectations.
f'(A) and f"(A) span a plane. There is a unique unit-speed The definition of principal curves immediately gives rise
circle in the plane that goes through f(A) and has the same to several interesting questions: For what kinds of distri-
velocity and acceleration at f(A) as the curve itself (see butions do principal curves exist, how many different prin-
Fig. 2). The radius rf(A) of this circle is called the radius cipal curves are there for a given distribution, and what
of curvature of the curve f at A; it is easy to see that rf(A) are their properties? We are unable to answer those ques-
= l/||f"(A)||. The center с,(А) of the circle is called the tions in general. We can, however, show that the definition
center of curvature of f at A. Thorpe (1979) gave a clear is not vacuous, and that there are densities that do have
introduction to these and related ideas in differential ge- principal curves.
ometry. It is easy to check that for ellipsoidal distributions the


Proof. The line has to pass through the origin, because
0 = E(X) = Е Л Е(Х | Af(X) = A)
= £ я (и 0 + Av0)
= U + Xv0.
o
Therefore, u 0 = 0 (recall that we assumed ii 0 1 v0). It
remains to show that v 0 is an eigenvector of 2, the co-
variance of X:
£v 0 = £(XX')vo
= ExE{XX% I№ = A)

= EkE{XX% | X'v 0 > A)
= EkE(XX | X'v0 = A)
= E^W

Principal components need not be self-consistent in the
sense of the definition; however, they are self-consistent
with respect to linear regression.
Figure 3. Each point on a principal curve is the average of the points
that project there. Proposition 2. Suppose that 1(A) is a straight line, and
that we linearly regress the p components Xj of X on the
projection AI(X) resulting in linear functions/7(A). Then,
principal components are principal curves. For a spheri- f = I iff v 0 is an eigenvector of 2 and u 0 = 0.
cally symmetric distribution, any line through the mean The proof of this requires only elementary linear algebra
vector is a principal curve. For any two-dimensional and is omitted.
spherically symmetric distribution, a circle with the cen-
ter at the origin and radius ZJ||X|| is a principal curve. A Distance Property of Principal Curves
(Strictly speaking, a circle does not fit our definition, be-
cause it does intersect itself. Nevertheless, see our note at An important property of principal components is that
the beginning of the Appendix, and Sec. 5.6, for more they are critical points of the distance from the observa-
details.) tions.
We show in the Appendix that for compact Л it is always Let d(x, f) denote the usual euclidean distance from a
possible to construct densities with the carrier in a thin point x to its projection on f: d(x, f) = ||x - f(Af(x))||,
tube around f, which have f as a principal curve. and define D2(h, f) = Ehd2(X, f). Consider a straight line
What about data generated from the model X = f(A) 1(A) = u + Av. The distance D2(h, f) in this case may be
+ €, with f smooth and E(e) = 0? Is f a principal curve regarded as a function of u and v: D2(h, /) = D2(h, u, v).
for this distribution? The answer generally seems to be It is well known that grad u>y D 2 (h, u, v) = 0 iff u = 0 and
no. We show in Section 7 in the more restrictive setting v is an eigenvector of 2 , that is, the line / is a principal-
of data scattered around the arc of a circle that the mean component line.
of the conditional distribution of x, given A(x) = A0, lies We now restate this fact in a variational setting and
outside the circle of curvature at A0; this implies that f extend it to principal curves. Let 9 denote a class of curves
cannot be a principal curve. So in this situation the prin- parameterized over Л. For g E 9 define f, = f + tg. This
cipal curve is biased for the functional model. We have creates a perturbed version of f.
some evidence that this bias is small, and it decreases to
0 as the variance of the errors gets small relative to the Definition 2. The curve f is called a critical point of
radius of curvature. We discuss this bias as well as esti- the distance function for variations in the class 9 iff
mation bias (which fortunately appears to operate in the d D K f,)
opposite direction) in Section 7. = 0 Vg E
dt f=0

3. CONNECTIONS BETWEEN PRINCIPAL CURVES
A N D PRINCIPAL COMPONENTS Proposition 3. Let 9/ denote the class of straight lines
g(A) = a + Ab. A straight line/ 0 ( A) = a 0 + Ab0 is a critical
In this section we establish some facts that make prin-
point of the distance function for variations in 9/ iff b 0 is
cipal curves appear as a reasonable generalization of linear
an eigenvector of cov(X) and a 0 = 0.
principal components.
The proof involves straightforward linear algebra and
Proposition 1. If a straight line /(A) = u 0 + Av0 is self- is omitted. A result analogous to Proposition 3 holds for
consistent, then it is a principal component. principal curves.


Proposition 4. Let § B denote the class of smooth ( C00)
curves parameterized over Л, with ||g|| < 1 and ||g'|| < 1.
Then f is a principal curve of h iff f is a critical point of
the distance function for perturbations 1 Уд.
П
A proof of Proposition 4 is given in the Appendix. The
condition that ||g|| is bounded guarantees that f, lies in a
thin tube around f and that the tubes shrink uniformly, as
t 0. The boundedness of ||g'|| ensures that for t small
enough, f' is well behaved and, in particular, bounded
away from 0 for t < 1. Both conditions together guarantee
that, for small enough t, Af< is well defined.
4. A N ALGORITHM FOR FINDING
PRINCIPAL CURVES
By analogy to linear principal-component analysis, we Figure 4. The mean of the observations projecting onto an endpoint
are particularly interested in finding smooth curves cor- of the curve can be disjoint from the rest of the curve.
responding to local minima of the distance function. Our
strategy is to start with a smooth curve, such as the largest 3. If the conditional-expectation operation in the prin-
linear principal component, and check if it is a principal cipal-curve algorithm is replaced by fitting a least squares
curve. This involves projecting the data onto the curve straight line, then the procedure converges to the largest
and then evaluating their expectation conditional on where principal component.
they project. Either this conditional expectation coincides
with the curve, or we get a new curve as a by-product. 5. PRINCIPAL CURVES FOR DATA SETS
We then check if the new curve is self-consistent, and so
So far, we have considered principal curves of a mul-
on. If the self-consistency condition is met, we have found
tivariate probability distribution. In reality, however, we
a principal curve. It is easy to show that both of the op-
usually work with finite multivariate data sets. Suppose
erations of projection and conditional expectation reduce
that X is an n x p matrix of n observations onp variables.
the expected distance from the points to the curve.
We regard the data set as a sample from an underlying
The Principal-Curve Algorithm probability distribution.
A curve f(A) is represented by n tuples (A„ f,), joined
The previous discussion motivates the following itera- up in increasing order of A to form a polygon. Clearly, the
tive algorithm. geometric shape of the polygon depends only on the order,
Initialization: Set f(0)(A) = x + aA, where a is the first not on the actual values of the A,-. We always assume that
linear principal component of h. Set A(0)(x) = Af<o>(x). the tuples are sorted in increasing order of A, and we use
Repeat: Over iteration counter j the arc-length parameterization, for which = 0 and A,
1. Set fW(.) = £(X | Af(/-i)(X) = •)• is the arc length along the polygon from fx to f,. This is
2. Define A°>(x) = Afo>(x) V x E h; transform № so the discrete version of the unit-speed parameterization.
that f ( y ) is unit speed. As in the distribution case, the algorithm alternates be-
3. Evaluate Dh, ft») = Exv>E[||X - f(A<')(X))|| | 2 tween a projection step and an expectation step. In the
A<»(X)]. absence of prior information we use the first principal-
2 (/)
Until: The change in D (h, f ) is below some threshold. component line as a starting curve; the f, are taken to be
the projections of the n observations onto the line.
There are potential problems with this algorithm. Al- We iterate until the relative change in the distance D2(h,
though principal curves are by definition differentiable, fO-D) _ DifU))/D2(h, ff"1)) is below some threshold
there is no guarantee that the curves produced by the (we use .001). The distance is estimated in the obvious
conditional-expectation step of the algorithm have this way, adding up the squared distances of the points in the
property. Discontinuities can certainly occur at the end- sample to their closest points on the current curve. We
points of a curve. The problem is illustrated in Figure 4, are unable to prove that the algorithm converges, or that
where the expected values of the observations projecting each step guarantees a decrease in the criterion. In prac-
onto f(Amin) and f(Amax) are disjoint from the new curve. tice, we have had no convergence problems with more
If this occurs, we have to join f(Amin) and f(Amax) to the than 40 real and simulated examples.
rest of the curve in a differentiable fashion. In light of the
previous discussion, we cannot prove that the algorithm 5.1 The Projection Step
converges. All we have is some evidence in its favor:
For fixed f(y)(") we wish to find for each x, in the sample
1. By definition, principal curves are fixed points of the the value A, = AfO^x,-).
algorithm. Define dik as the distance between x( and its closest point
2. Assuming that each iteration is well defined and pro- on the line segment joining each pair (f(;)(A^), f(y)(^i+i)).
duces a differentiable curve, we can show that the expected Corresponding to each dik is a value Xik E [AJ^, A^J. We
distance D2(h, f (;) ) converges. then set At to the Xik corresponding to the smallest value


of dik: 5.3 A Demonstration of the Algorithm

=
To illustrate the principal-curve procedure, we gener-
A; = Xik* if di^ min dik. (4) ated a set of 100 data points from a circle in two dimensions
k=i with independent Gaussian errors in both coordinates:
Corresponding to each A, is an inteфolated f{ y) ; using
these values to represent the curve, we replace A, by the xi = (5 sin(A)^ + (e
x2J 5cos(A)/ e2
(5)
arc length from f^ to fp>.
where A is uniformly distributed on [0, 2n) and ex and e 2
5.2 The Conditional-Expectation Step: are independent N(0, 1).
Scatterplot Smoothing Figure 5 shows the data, the circle (dashed line), and
The goal of this step is to estimate f^+1>(A) = E(X the estimated curve (solid line) for selected steps of the
A(> = A). We restrict ourselves to estimating this quantity iteration. The starting curve is the first principal compo-
f/
at n values of A, namely Ab . . . , An found in the projectionnent (Fig. 5a). Any line through the origin is a principal
step. A natural way of estimating E(X | Afo> = A,) would curve for the population model (5), but this is not generally
be to gather all of the observations that project onto f 0 ) the case for data. Here the algorithm converges to an
at А and find their mean. Unfortunately, there is generally estimate for another population principal curve, the circle.
/
only one such observation, x,. It is at this stage that we This example is admittedly artificial, but it presents the
introduce the scatterplot smoother, a fundamental building principal-curve procedure with a particularly tough job.
block in the principal-curve procedure for finite data sets. The starting guess is wholly inappropriate and the projec-
We estimate the conditional expectation at A, by averaging tion of the points onto this line does not nearly represent
all of the observations k in the sample for which Xk is closethe final ordering of the points when projected onto the
to A,. As long as these observations are close enough and solution curve. Points project in a certain order on the
the underlying conditional expectation is smooth, the bias starting vector (as depicted in Fig. 6). The new curve is a
(0)
introduced in approximating the conditional expectation function of A obtained by averaging the coordinates of
(0) (1)
is small. On the other hand, the variance of the estimate points close in A . The new A values are found by pro-
decreases as we include more observations in the neigh- jecting the points onto the new curve. It can be seen that
borhood. the ordering of the projected points along the new curve
can be very different from the ordering along the previous
Scatterplot Smoothing. Local averaging is not a new curve. This enables the successive curves to bend to shapes
idea. In the more common regression context, scatterplot that could not be parameterized as a function of the linear
smoothers are used to estimate the regression function principal component.
E( Y | x) by local averaging. Some commonly used smooth-
ers are kernel smoothers (e.g., Watson 1964), spline
smoothers (Silverman 1985; Wahba and Wold 1975) , and
the locally weighted running-line smoother of Cleveland
(1979). All of these smooth a one-dimensional response
against a covariate. In our case, the variable to be : .-туи .
smoothed is p-dimensional, so we simply smooth each co-
ordinate separately. Our current implementation of the
algorithm is an S function (Becker, Chambers, and Wilks
1988) that allows any scatterplot smoother to be used. We 1
have experience with all of those previously mentioned,
although all of the examples were fitted using locally
weighted running lines. We give a brief description; for
details see Cleveland (1979).
Locally Weighted Running-Lines Smoother. Consider
the estimation of E(x | A), that is, a single coordinate
function based on a sample of pairs (A1? jci), . . . , (A„,
xn), and assume the A, are ordered. To estimate E(x | A),
the smoother fits a straight line to the wn observations {хД
closest in A to A,. The estimate is taken to be the fitted
y ' J:...,.»*-" •
value of the line at A,. The fraction w of points in the
neighborhood is called the span. In fitting the line,
weighted least squares regression is used. The weights are
derived from a symmetric kernel centered at A, that dies Figure 5. Selected Iterates of the Principal-Curve Procedure for the
smoothly to 0 within the neighborhood. Specifically, if A, Circle Data. In all of the figures we see the data, the circle from which
is the distance to the vwth nearest neighbor, then the the data are generated, and the current estimate produced by the
algorithm: (a) the starting curve is the principal-component line, with
points Xj in the neighborhood get weights = (1 - |(Ay average squared distance D2(f(0)) = 12.91; (b) iteration 2: D2(f(2)) =
10.43; (c) iteration 4: D2(f<4)) = 2.58; (d) final iteration 8: D2(f<8)) = 1.55.


Figure 6. Schematics Emphasizing the Iterative Nature of the Algorithm. The curve of the first iteration is a function of l(0) measured along
starting vector (a). The curve of the second iteration is a function of X(1) measured along the curve of the first iteration (b).

5.4 Span Selection for the Scatterplot Smoother closely. The human eye is skilled at making trade-offs
between smoothness and fidelity to the data; we would
The crucial parameter of any local averaging smoother
like a procedure that makes this judgment automatically.
is the size of the neighborhood over which averaging takes
A similar situation arises in nonparametric regression,
place. We discuss the choice of the span w for the locally
where we have a response у and a covariate x. One ra-
weighted running-line smoother.
tionale for making the smoothness judgment automatically
A Fixed-Span Strategy. The common first guess for f is to ensure that the fitted function of x does a good job
is a straight line. In many interesting situations, the final in predicting future responses. Cross-validation (Stone
curve is not a function of the arc length of this initial curve 1974) is an approximate method for achieving this goal,
(see Fig. 6). It is reached by successively bending the orig- and proceeds as follows. We predict each response yt in
inal curve. We have found that if the initial span of the the sample using a smooth estimated from the sample with
smoother is too small, the curve may bend too fast, and the ith observation omitted; let be this predicted value,
follow the data too closely. Our most successful strategy and define the cross-validated residual sum of squares as
has been to initially use a large span, and then to decrease CVRSS = (yt - S{i)f- CVRSSIn is an approxi-
it gradually. In particular, we start with a span of .6n mately unbiased estimate of the expected squared predic-
observations in each neighborhood, and let the algorithm tion error. If the span is too large, the curve will miss
converge (according to the criterion outlined previously). features in the data, and the bias component of the pre-
We then drop the span to .5n and iterate till convergence. diction error will dominate. If the span is too small, the
Finally, the same is done at An, by which time the pro- curve begins to fit the noise in the data, and the variance
cedure has found the general shape of the curve. The component of the prediction error will increase. We pick
curves in Figure 5 were found using this strategy. the span that corresponds to the minimum CVRSS.
Spans of this magnitude have frequently been found In the principal-curve algorithm, we can use the same
appropriate for scatterplot smoothing in the regression procedure for estimating the spans for each coordinate
context. In some applications, especially the two-dimen- function separately, as a final smoothing step. Since most
sional ones, we can plot the curve and the points and select smoothers have this feature built in as an option, cross-
a span that seems appropriate for the data. Other appli- validation in this manner is trivial to implement. Figure
cations, such as the collider-ring example in Section 8, 7a shows the final curve after one more smoothing step,
have a natural criterion for selecting the span. using cross-validation to select the span—nothing much
has changed.
Automatic Span Selection by Cross-Validation. Assume On the other hand, Figure 7b shows what happens if we
the procedure has converged to a self-consistent (with re- continue iterating with the cross-validated smoothers. The
spect to the smoother) curve for the span last used. We spans get successively smaller, until the curve almost in-
do not want the fitted curve to be too wiggly relative to terpolates the data. In some situations, such as the Stan-
the density of the data. As we reduce the span, the average ford linear collider example in Section 8, this may be
distance decreases and the curve follows the data more exactly what we want. It is unlikely, however, that in this


2. Given (7) splits up into p expressions of the form
(6), one for each coordinate function. These are optimized
by smoothing the p coordinates against using a cubic
spline smoother with parameter ju.
The usual penalized least squares arguments show that
if a minimum exists, it must be a cubic spline in each
coordinate. We make no claims about its existence, or
about global convergence properties of this algorithm.
An advantage of the spline-smoothing algorithm is that
it can be computed in 0(n) operations, and thus is a strong
competitor for the kernel-type smoothers that take 0(n2)
Figure 7. (a) The Final Curve in Figure 6 With One More Smoothing
unless approximations are used. Although it is difficult to
Step, Using Cross-Validation Separately for Each of the Coordinates—
&(№) = 1.28. (b) The Curve Obtained by Continuing the Iterations guess the smoothing parameter /г, alternative methods
(-.12), Using Cross-Validation at Every Step. such as using the approximate degrees of freedom (see
Cleveland 1979) are available for assessing the amount of
event cross-validation would be used to pick the span. A smoothing and thus selecting the parameter.
possible explanation for this behavior is that the errors in Our current implementation of the algorithm allows a
the coordinate functions are autocorrelated; cross-vali- choice of smoothing splines or locally weighted running
dation in this situation tends to pick spans that are too lines, and we have found it difficult to distinguish their
small (Hart and Wehrly 1986). performance in practice.
5.5 Principal Curves a n d Splines 5.6 Further Illustrations a n d Discussion of
the Algorithm
Our algorithm for estimating principal curves from sam-
The procedure worked well on the circle example and
ples is motivated by the algorithm for finding principal
several other artificial examples. Nevertheless, sometimes
curves of densities, which in turn is motivated by the
its behavior is surprising, at least at first glance. Con-
definition of principal curves. This is analogous to the
sider a data set from a spherically symmetric unimodal
motivation for kernel smoothers and locally weighted
distribution centered at the origin. A circle with radius
running-line smoothers. They estimate a conditional ex-
£||x|| is a principal curve, as are all straight lines passing
pectation, a population quantity that minimizes a popu-
through the origin. The circle, however, has smaller
lation criterion. They do not minimize a data-dependent
expected squared distance from the observations than the
criterion.
lines.
On the other hand, smoothing splines do minimize data-
The 150 points in Figure 8 were sampled indepen-
dependent criteria. The cubic smoothing spline for a set
dently from a bivariate spherical Gaussian distribution.
of n pairs (Л ь хг)9 . . . , (Л„, xn) and penalty (smoothing
When the principal-curve procedure is started from the
parameter)//minimizes
circle, it does not move much, except at the endpoints (as
depicted in Fig. 8a). This is a consequence of the smooth-
D n = £(*, - m y + n f (f'(X))2 dX, (6) ers' endpoint behavior in that it is not constrained to be
1=1 J
periodic. Figure 8b shows what happens when we use a
among all functions/with / ' absolutely continuous and/" periodic version of the smoother, and also start at a circle.
e L 2 (e.g., see Silverman 1985). We suggest the following Nevertheless, starting from the linear principal component
criterion for defining principal curves in this context: Find (where theoretically it should stay), and using the non-
f(A) and A, e [0, 1] (i = 1, . . . , и) so that periodic smoother, the algorithm iterates to a curve that,
apart from the endpoints, appears to be attempting to
Dt, X) = 2 И , - f(A,)||2 + ц P ||f"(A)||2 dX (7) model the circle. (See Fig. 8c; this behavior occurred re-
х-
/=1 Jo peatedly over several simulations of this example. The
ends of the curve are stuck and further iterations do not
is minimized over all f with f) E S2[0, 1]. Notice that we
free them.)
have confined the functions to the unit interval and thus
do not use the unit-speed parameterization. Intuitively, The example illustrates the fact that the algorithm tends
for a fixed smoothing parameter functions defined over to find curves that are minima of the distance function.
an arbitrarily large interval can satisfy the second-deriv- This is not surprising; after all, the principal-curve algo-
ative smoothness criterion and visit every point. It is easy rithm is a generalization of the power method for finding
to make this argument rigorous. eigenvectors, which exhibits exactly the same behavior.
We now apply our alternating algorithm to these cri- The power method tends to converge to an eigenvector
teria: for the largest eigenvalue, unless special precautions are
taken.
1. Given f, minimizing D2(f, X) over Я only involves the
,
- Interestingly, the algorithm using the periodic smoother
first part of (7) and is our usual projection step. The Л and starting from the linear principal component finds a
,
-
are rescaled to lie in [0, 1]. circle identical to that in Figure 8b.


Figure 8. Some Curves Produced by the Algorithm Applied to Bivariate Spherical Gaussian Data: (a) The Curve Found When the Algorithm
Is Started at a Circle Centered at the Mean; (b) The Circle Found Starting With Either a Circle or a Line but Using a Periodic Smoother; (c) The
Curve Found Using the Regular Smoother, but Starting at a Line. A periodic smoother ensures that the curve found is closed.

6. BIAS CONSIDERATIONS: MODEL AND tionary point of the algorithm; the principal curve is a
ESTIMATION BIAS circle with radius r* > p. The factor sin(0/2)/(6/2) is at-
tributable to local averaging. There is clearly an optimal
Model bias occurs when the data are of the form x =
span at which the two bias components cancel exactly. In
f(A) + e and we wish to recover f(A). In general, if f(A)
practice, this is not much help, since we require knowledge
has curvature, it is not a principal curve for the distribution
of the radius of curvature and the error variance is needed
it generates. As a consequence, the principal-curve pro-
cedure can only find a biased version of f(A), even if it to determine it. Typically, these quantities will change as
starts at the generating curve. This bias goes to 0 with the we move along the curve. Hastie (1984) gives a demon-
ratio of the noise variance to the radius of curvature. stration that these bias patterns persist in a situation where
the curvature changes along the curve.
Estimation bias occurs because we use scatterplot
smoothers to estimate conditional expectations. The bias 7. EXAMPLES
is introduced by averaging over neighborhoods, which usu-
ally has a flattening effect. We demonstrate this bias with This section contains two examples that illustrate the
a simple example. use of the procedure.
7.1 The Stanford Linear Collider Project
A Simple M o d e l for investigating Bias
This application of principal curves was implemented
Suppose that the curve f is an arc of a circle centered
by a group of geodetic engineers at the Stanford Linear
at the origin and with radius p, and the data x are generated
Accelerator Center (SLAC) in California. They used the
from a bivariate Gaussian, with mean chosen uniformly
on the arc and variance оЧ. Figure 9 depicts the situation.
Intuitively, it seems that more mass is put outside the circle
than inside, so the circle closest to the data should have
radius larger than p. Consider the points that project onto
a small arc Л#(Л) of the circle with angle в centered at A,
as depicted in the figure. As we shrink this arc down to a
point, the segment shrinks down to the normal to the curve
at that point, but there is always more mass outside the
circle than inside. This implies that the conditional ex-
pectation lies outside the circle.
We can prove (Hastie 1984) that E(x | Af(x) E Ae(A))
= (re/pt)(A), where
sin(0/2)
r0 = Г 0/2 (8)
and
r* = E[(p + e,f + el]
~p + (o42p).
Finally, p as а!p 0.
Equation (8) nicely separates the two components of Figure 9. The data are generated from the arc of a circle with radius
bias. Even if we had infinitely many observations and thus p and with iid N(0, a2!) errors. The location on the circle is selected
would not need local averaging to estimate conditional uniformly. The best fitting circle (dashed) has radius larger than the
expectation, the circle with radius p would not be a sta- generating curve.


software developed by the authors in consultation with the the magnets to the ideal curve, but rather to a curve
first author and Jerome Friedman of SLAC. through the existing magnet positions that was smooth
The Stanford linear collider (SLC) collides two intense enough to allow focused bending of the beam. This strat-
and finely focused particle beams. Details of the collision egy would theoretically reduce the amount of magnet
are recorded in a collision chamber and studied by particle movement necessary. The principal-curve procedure was
physicists, whose major goal is to discover new subatomic used to find this curve. The remainder of this section de-
particles. Since there is only one linear accelerator at scribes some special features of this simple but important
SLAC, it is used to accelerate a positron and an electron application.
bunch in a single pulse, and the collider arcs bend these Initial attempts at fitting curves used the data in the
beams to bring them to collision (see Fig. 10). measured three-dimensional goedetic coordinates, but it
Each of the two collider arcs contain roughly 475 mag- was found that the magnet displacements were small rel-
nets (23 segments of 20 plus some extras), which guide ative to the bias induced by smoothing. The theoretical
the positron and electron beam. Ideally, these magnets arc was then removed, and subsequent curve fitting was
lie on a smooth curve with a circumference of about based on the residuals. This was achieved by replacing the
3 kilometers (km) (as depicted in the schematic). The three coordinates of each magnet with three new coordi-
collider has a third dimension, and actually resembles a nates: (a) the arc length from the beginning of the arc till
floppy tennis racket, because the tunnel containing the the point of projection onto the ideal curve (*), (b) the
magnets goes underground (whereas the accelerator is distance from the magnet to this projection in the hori-
aboveground). zontal plane (y), and (c) the distance in the vertical plane
Measurement errors were inevitable in the procedure (z).
used to place the magnets. This resulted in the magnets This technique effectively removed the major compo-
lying close to the planned curve, but with errors in the nent of the bias and is an illustration of how special sit-
range of ±1.0 millimeters (mm). A consequence of these uations lend themselves to adaptations of the basic
errors was that the beam could not be adequately focused. procedure. Of course, knowledge of the ideal curve is not
The engineers realized that it was not necessary to move usually available in other applications.
There is a natural way of choosing the smoothing pa-
rameter in this application/The fitted curve, once trans-
collision chamber formed back to the original coordinates, can be rep-
resented by a polygon with a vertex at each magnet.
The angle between these segments is of vital importance,
since the further it is from 180°, the harder it is to launch
the particle beams into the next segment without hitting
the wall of the beam pipe [diameter 1 centimeter (cm)].
In fact, if 6i measures the departure of this angle from
180°, the operating characteristics of the magnet specify a
threshold 0 m a x of .1 milleradian. Now, no smoothing results
in no magnet movement (no work), but with many mag-
nets violating the threshold. As the amount of smoothing
(span) is increased, the angles tend to decrease, and the
residuals and thus the amounts of magnet movement in-
crease. The strategy was to increase the span until no
magnets violated the angle constraint. Figure 11 gives the
fitted vertical and horizontal components of the chosen
curve, for a section of the north arc consisting of 149
magnets. This relatively rough curve was then translated
back to the original coordinates, and the appropriate ad-
justments for each magnet were determined. The system-
atic trend in these coordinate functions represents sys-
tematic departures of the magnets from the theoretical
curve. Only 66% of the magnets needed to be moved,
since the remaining 34% of the residuals were below 60
jum in length and thus considered negligible.
There are some natural constraints on the system. Some
of the magnets were fixed by design and thus could not
be moved. The beam enters the arc parallel to the accel-
erator, so the initial magnets do no bending. Similarly,
there are junction points at which no bending is allowed.
Figure 10. A Rough Schematic of the Stanford Linear Accelerator These constraints are accommodated by attaching weights
and the Linear Collider Ring. to the points representing the magnets and using a


metals. Before bidding for a particular cargo, the company
takes a sample to estimate the gold content of the whole
lot. The sample is split in two. One subsample is assayed
by an outside laboratory, and the other by their own in-
house laboratory. The company eventually wishes to use
100 200 only one of the assays. It is in their interest to know which
arc length (m) laboratory produces on average lower gold-content assays
for a given sample.
The data in Figure 12 consist of 250 pairs of gold assays.
Each point represents an observation x, with x}i = log(l
+ assay yield for the ith assay pair for lab ;), where j =
1 corresponds to the outside lab and j = 2 to the in-house
lab. The log transformation stabilizes the variance and
produces a more even scatter of points than the untrans-
formed data. [There were many more small assays (1
ounce (oz) per ton) than larger ones (>10 oz per ton).]
arc length (m)
Our model for these data is
Figure 11. The Fitted Coordinate Functions for the Magnet Positions
for a Section of the Standard Linear Collider. The data represent re- Хц ЯО
siduals from the theoretical curve. Some (35%) of the deviations from + (9)
*2i
the fitted curve were small enough that these magnets were not moved.
where r, is the expected gold content for sample i using
weighted version of the smoother in the algorithm. By the in-house lab assay, /(r,) is the expected assay result
giving the fixed magnets sufficiently large weights, the for the outside lab relative to the in-house lab, and efl is
constraints are met. Figure 11 has the parallel constraints measurement error, assumed iid with var^,) = var(e2,)
built in at the endpoints. V/.
Finally, since some of the magnets were way off target, This is a generalization of the linear errors-in-variables
we used a resistant version of the fitting procedure. Points model, the structural model (if we regard the r, themselves
are weighted according to their distance from the fitted as unobservable random variables), or the functional
curve, and deviations beyond a fixed threshold are given model (if the r, are considered fixed):
weight 0. _ U + P*l
i
+ (10)
7.2 Gold Assay Pairs
Model (10) essentially looks for deviations from the 45°
A California-based company collects computer-chip line, and is estimated by the first principal component.
waste to sell it for its content of gold and other precious Model (9) is a special case of the principal-curve model,

b с
ШГ

Ж
Jr*
/W
Ш Jf
w'
M
/ f

Figure 12. (a) Plot of the Log Assays for the In-House and Outside Labs. The solid curve is the principal curve, the dotted curve the scatterpl
smooth, and the dashed curve the 45° line, (b) A Band of 25 Bootstrap Curves. Each curve is the principal curve of a bootstrap sample. A
bootstrap sample is obtained by randomly assigning errors to the principal curve for the original data (solid curve). The band of curves appe
to be centered at the solid curve, indicating small bias. The spread of the curves gives an indication of variance, (c) Another Band of 25 Bootstr
Curves. Each curve is the principal curve of a bootstrap sample, based on the linear errors-in-variables regression line (solid line). This simula
tests the null hypothesis of no kink. There is evidence that the kink is real, since the principal curve (solid curve) lies outside this band in
region of the kink.


where one of the coordinate functions is the identity. This f is a vector of continuous functions:
identifies the systematic component of variable x2 with the (Ши a 2 )
arc-length parameter. Similarly, we estimate (9) using a
natural variant of the principal-curve algorithm. In the
smoothing step we smooth only х г against the current value fp(h w
of r, and then update т by projecting the data onto the
curve defined by (/(т), т). Let X be defined as before, and let f denote a smooth
The dotted curve in Figure 12 is the usual scatterplot two-dimensional surface in R p , parameterized over Л С
smooth of xx against x2 and is clearly misleading as a scat- R 2 . Here the projection index {(x) is defined to be the
terplot summary. The principal curve lies above the 45° parameter value corresponding to the point on the surface
line in the interval 1.4-4, which represents an untrans- closest to x.
formed assay content interval of 3-15 oz/ton. In this in- The principal surfaces of h are those members of § 2 that
terval the in-house assay tends to be lower than that of are self-consistent: £(X | f(X) = = f() for a.e.
the outside lab. The difference is reversed at lower levels, Figure 13 illustrates the situation. We do not yet have a
but this is of less practical importance, since at these levels rigorous justification for these definitions, although we
the lot is less valuable. have had success in implementing an algorithm.
A natural question arising at this point is whether the The principal-surface algorithm is similar to the curve
bend in the curve is real, or whether the linear model (10) algorithm; two-dimensional surface smoothers are used
is adequate. If we had access to more data from the same instead of one-dimensional scatterplot smoothers. See
population we could simply calculate the principal curves Hastie (1984) for more details of principal surfaces, the
for the additional samples and see for how many of them algorithm to compute them, and examples.
the bend appeared. 9. DISCUSSION
In the absence of such additional samples, we use the Ours is not the first attempt at finding a method for
bootstrap (Efron 1981, 1982) to simulate them. We com- fitting nonlinear manifolds to multivariate data. In dis-
pute the residual vectors of the observed data from the cussing other approaches to the problem we restrict our-
fitted curve in Figure 12a, and treating them as iid, we selves to one-dimensional manifolds (the case treated in
pool all 250 of them. Since these residuals are derived this article).
from a projection essentially onto a straight line, their The approach closest in spirit to ours was suggested by
expected squared length is half that of the residuals in Carroll (1969). He fit a model of the form x, = p(A,) +
Model (9). We therefore scale them up by a factor of e„ where p(A) is a vector of polynomials p/A) =
V2. We then sampled with replacement from this pool, ajkXk of prespecified degrees K). The goal is to find the
and reconstructed a bootstrap replicate by adding a sam- coefficients of the polynomials and the Af (i = 1, . . . , n)
pled residual vector to each of the fitted values of the minimizing the loss function 2 ||e/||2. The algorithm makes
original fit. For each of these bootstrapped data sets the use of the fact that for given Ab . . . , A„, the optimal
entire curve-fitting procedure was applied and the fitted
curves were saved. This method of bootstrapping is aimed
at exposing both bias and variance.
Figure 12b shows the errors-in-variables principal curves
obtained for 25 bootstrap samples. The spreads of these
curves give an idea of the variance of the fitted curve. The
difference between their average and the original fit es-
timates the bias, which in this case is negligible. — •
•
Figure 12c shows the result of a different bootstrap ex-
•
periment. Our null hypothesis is that the relationship is •
> •
linear, and thus we sampled in the same way as before but
we replaced the principal curve with the linear errors-in-
variables line. The observed curve (thick solid curve) lies )
outside the band of curves fitted to 25 bootstrapped data
sets, providing additional evidence that the bend is indeed / V '
real.
f(x) ' ^ v v ^
8. EXTENSION TO HIGHER DIMENSIONS:
PRINCIPAL SURFACES • • A
We have had some success in extending the definitions
and algorithms for curves to two-dimensional (globally
parameterized) surfaces.
A continuous two-dimensional globally parameterized Figure 13. Each point on a principal surface is the average of the
surface in R p is a function f : Л R p for А С R 2 , where points that project there.


polynomial coefficients can be found by linear least APPENDIX: PROOFS OF PROPOSITIONS
squares, and the loss function thus can be written as a
We make the following assumptions: Denote by X a random
function of the А/ only. Carroll gave an explicit formula vector in R p with density h and finite second moments. Let f
for the gradient of the loss function, which is helpful in denote a smooth (C x ) unit-speed curve in Rp parameterized over
the и-dimensional numerical optimization required to find a closed, possibly infinite interval Л С R1. We assume that f does
the optimal A's. not intersect itself Ф A2 f(A0 # f(A2)] and has finite length
The model of Etezadi-Amoli and McDonald (1983) is inside any finite ball. Under these conditions, the set {f(A), A E
the same as Carroll's, but they used different goodness- A} forms a smooth, connected one-dimensional manifold diffeo-
of-fit measures. Their goal was to minimize the off-diag- morphic to the interval A. Any smooth, connected one-dimen-
onal elements of the error covariance matrix X = E*E, sional manifold is diffeomorphic either to an interval or a circle
which is in the spirit of classical linear factor analysis. (Milnor 1965). The results and proofs following could be slightly
Various measures for the cumulative size of the off-diag- modified to cover the latter case (closed curves).
onal elements are suggested, such as o. Their algo- Existence of the Projection Index
rithm is similar to ours in that it alternates between
improving the A's for given polynomial coefficients and Existence of the projection index is a consequence of the fol-
finding the optimal polynomial coefficients for given A's. lowing two lemmas.
The latter is a linear least squares problem, whereas the Lemma 5.1. For every x E Rp and for any r > 0, the set Q
former constitutes one step of a nonlinear optimization in = {A | ||x - f(A)|| ^ r} is compact.
n parameters. Proof. Q is closed, because ||x - f(A)|| is a continuous func-
Shepard and Carroll (1966) proceeded from the as- tion of A. It remains to show that Q is bounded. Suppose that
it were not. Then, there would exist an unbounded monotone
sumption that the p-dimensional observation vectors lie
sequence Ab A2, . . . , with ||x - f(A,)|| ^ r. Let В denote the
exactly on a smooth one-dimensional manifold. In this ball around x with radius 2r. Consider the segment of the curve
case, it is possible to find parameter values A b . . . , A „ between f(Af) and f(Ai+1). The segment either leaves and reenters
such that for each one of the p coordinates, xtj varies B, or it stays entirely inside. This means that it contributes at
smoothly with A . The basis of their method is a measure
* least min(2r, |Ai+1 - A,|) to the length of the curve inside B. As
for the degree of smoothness of the dependence of x t i on there are infinitely many such segments, and the sequence {AJ
A,. This measure of smoothness, summed over the p co- is unbounded, f would have infinite length in B, which is a con-
ordinates, is then optimized with respect to the A's: one tradiction.
finds those values of A b . . . , A that make the dependence
„ Lemma 5.2. For every x E Rp, there exists A E A for which
of the coordinates on the A's as smooth as possible. ||x - f(A)|| = inf„eA ||x - f0i)||.
We do not go into the definition and motivation of the Proof. Define r = inf„6A ||x - f(//)||. Set В = {ц | ||x - Vji)
smoothness measure; it is quite subtle, and we refer the < 2r}. Obviously, inf^A ||x - f(//)|| = i n f ^ ||x - f(//)||. Since
interested reader to the original source. We just wish to В is nonempty and compact (Lemma 5.1), the infimum on the
point out that instead of optimizing smoothness, one could right side is attained.
optimize a combination of smoothness and fidelity to the Define d(x, f) = inf„6A ||x - f(//)||.
data as described in Section 5.5, which would lead to
Proposition 5. The projection index Af(x) = sup;{A | ||x -
modeling the coordinate functions as spline functions and
f(A)|| = d(x, f)} is well defined.
should allow the method to deal with noise in the data Proof. The set {A | ||x - f(A)|| = d(x, f)} is nonempty (Lemma
better. 5.2) and compact (Lemma 5.1), and therefore has the largest
In view of this previous work, what do we think is the element.
contribution of the present article?
It is not hard to show that Af(x) is measurable; a proof is
• From the operational point of view it is advantageous available on request.
that there is no need to specify a parametric form for the
Stationarity of the Distance Function
coordinate functions. Because the curve is represented as
a polygon, finding the optimal A's for given coordinate We first establish some simple facts that are of interest in
functions is easy. This makes the alternating minimization themselves.
attractive and allows fitting of principal curves to large Lemma 6.1. If f(A0) is a closest point to x and A0 E A0, the
data sets. interior of the parameter interval, then x is in the normal hy-
• From the theoretical point of view, the definition of perplane to f at f(A0): (x - f(A0), f'(A0)> = 0.
principal curves as conditional expectations agrees with Proof. dx - f(A)||2/dA = 2<x - f(A), f'(A)>. If f(^0) is a
our mental image of a summary. The characterization of closest point and the derivative is defined (A0 E A0), then it has
principal curves as critical points of the expected squared to vanish.
distance from the data makes them appear as a natural Definition. A point x E Rp is called an ambiguity point
for a curve f if it has more than one closest point on the curve:
generalization of linear principal components. This close card{A | ||x - f(A)|| = d(x, f)} > 1.
connection is further emphasized by the fact that linear Let A denote the set of ambiguity points. Our next goal is to
principal curves are principal components, and that the show that A is measurable and has measure 0.
algorithm converges to the largest principal component if Define M ; , the orthogonal hyperplane to f at A, by Mx =
conditional expectations are replaced by least squares {x | <x - f(A), f'(A)> = 0}. Now, we know that if f(A) is a closest
straight lines. point to x on the curve and A E A0, then x E M;>. It is useful to


define a mapping that maps A x R^ - 1 into U;. Choose p - shows that grad(d,(y)) = 2(y - f(A,(y))). A point у E N(x) can
1 smooth vector fields nÂ), . . . , пр-г(Х) such that for every Д be an ambiguity point only if у E Ац for some i Ф j, where Ац
the vectors f'(^) and nÂ), . . . , n^-Â) are orthogonal. It is = {z E N(x) | di{i) = dj(z), A,(z) Ф z)}. Nevertheless, for
well known that such vector fields do exist. Define x : A x R^ - 1 z) Ф z), grad(d,(z) - dj{z)) # 0, because the curve f(A)
Rp by х(Я, v) = f(A) + 2 Г / i>,n,(/l), and set M = (A, was assumed not to intersect itself. Thus Ац is a smooth, possibly
R p - 1 ) , the set of all points in Rp lying in some hyperplane for not connected manifold of dimension p - 1, which has measure
some point on the curve. The mapping x is smooth, because f 0, and ju(A П N(x)) < = 0.
and n b • • • , are assumed to be smooth.
We have glossed over a technicality: Sard's theorem requires
We now present a few observations that simplify showing that h to be defined on an open set. Nevertheless, we can always
A has measure 0. extend / in a smooth way beyond the boundaries of the interval.
Lemma 6.2. ju(A П Mc) = 0. In the following, let § B denote the class of smooth curves
Proof. Suppose that x E AD Mc. According to Lemma 6.1, parameterized over A, with ||g(A)|| ^ 1 and ||g'(^)ll — 1- F°r g E
this is only possible if A is a finite closed interval [Amin, Amax] andQB, define f,(A) = f(X) + tg(X). It is easy to see that f, has finite
x is equidistant from the endpoints f(/lmin) and f(Amax). The set length inside any finite ball and for t < 1, is well defined.
of all such points forms a hyperplane that has measure 0. There- Moreover, we have the following lemma.
fore А П Mc, as a subset of this measure-0 set, is measurable
and has measure 0. Lemma 4.1. If x is not an ambiguity point for f, then lim, i0
Af,(x) = Xt(x).
Lemma 6.3. Let E be a measure-0 set. It is sufficient to show Proof. We have to show that for every e > 0 there exists S
that for every x E RPE there exists an open neighborhood N(x) > 0 such that for all t < 6, |Af|(x) - Af(x)| < e. Set С = А П
with ju(A П N(x)) = 0. (A,(x) - e, Af(x) + e)c and dc = infXEC II* - t(X)- The infimum
P P
Proof. The open covering {N(x) | x E R E} of R E con- is attained and dc > ||x - f(Af(x))||, because x is not an ambiguity
tains a countable covering {N,}, because the topology of Rp has point. Set S = h(dc - ||x - f(A,(x))||). Now, Aff(x) E (Я,(х) -
a countable base. e and Af(x) + e) V t < S, because
Lemma 6.4. We can restrict ourselves to the case of com- inf (||x - f,(A)|| - ||x - f,(A|(x))||)
pact A. ;.ec
Proof. Set A„ = Л П [~n, n], fn = f/A„, and An as the set >dc - 6 - ||x - f(Af(x))|| - S
of ambiguity points of fn. Suppose that x is an ambiguity point
of f; then {X | ||x - f(A)|| = d(x, f)} is compact (Lemma 5.1). = S > 0.
Therefore, x E An for some n, and А С UT An. Proof of Proposition 4. The curve f is a principal curve of h
We are now ready to prove Proposition 6. iff
dDh, f,)
Proposition 6. The set of ambiguity points has measure 0. = 0 VgE
dt
Proof. We can restrict ourselves to the case of compact A
(Lemma 6.4). As ju(A П Mc) = 0 (Lemma 6.2) it is sufficient We use the dominated convergence theorem to show that we can
to show that for every x E M, with the possible exception of a interchange the orders of integration and differentiation in the
set С of measure 0, there exists a neighborhood N(x) with ц{А expression
П N(x)) = 0.
We choose С to be the set of critical values of x- [A point у | D2(h, f,) = | Eh||X - f,af,(X))||2. (A.l)
E M is called a regular value if rank(x'( x )) = P f° r aH x e
X - 1 ( y ) j otherwise у is called a critical value.] By Sard's theorem We need to find a pair of integrable random variables that almost
(Milnor 1965), С has measure 0. surely bound
Pick x E M П Cc. We first show that x - 1 ( x ) i s a finite s e t { ( ^ _ _ IX - f,qf,(x))ip - ||x - f(A,(X))ip
I
Vi), . . . , (Xk, v*)}. Suppose that on the contrary there was an z'" t
infinite set {(&, w0, (<f2, w2), . . .} with x(6, w,) = x. By com-
pactness of A and continuity of there would exist a cluster for all sufficiently small t > 0.
point £o of £ 2 , • • •} and a corresponding w0 with xCô» w0) Now,
= x. On the other hand, x was assumed to be a regular value ||X -f,(A.(X))|P - ||X - f(Af(X))||2
of x, and thus x would be a diffeomorphism between a neigh- Z
borhood of (A0, w0) and a neighborhood of x. This is a contra-
diction. Expanding the first norm we get
2 2
Because x is a regular value, there are neighborhoods L,(/I,, ||X - f,(MX))|P = ||X - f(Af(X))|P + < ||g(A,(X))||
у,) and a neighborhood N(x) such that x is a diffeomorphism - 2t(X - f(Я,(Х))) • g(A|(X)),
between L, and N. Actually, a stronger statement holds. We can
1
find N(x) С N(x), for which х~ (Ю С Uf L,. Suppose that this and thus
were not the case. Then, there would exist a sequence xux2, . . . Z, s -2(X - Г(Я,(Х))) • g(Ar(X)) + r||g(Af(X))|p. (A.2)
-> x and corresponding щ) g U?=i L, with х(£/, Щ) = */• Using the Cauchy-Schwarz inequality and the assumption that
The set {£ b <J2, . . .} has a cluster point £ U Lb and by ||g|| £ l , Z , < 2||X - f(A (X))|| + 1 < 2||X - f(X ) + 1 V t <
r 0
continuity x(£o> w0) = x, which is a contradiction. 1 and arbitrary A0. As ||X|| was assumed to be integrable, so is
We have now shown that for у E N(x) there exists exactly one ||X - f(A„)||, and therefore Z,.
pair (A,(y), v,(y)) E Ц, with х(Му)> у)) = y, and A,(y) is a Similarly, we have
smooth function of y. Define A0(y) = and A*+1(y) = Xmax.
2
Set di(у) = ||y = Г(А/(у))|| . A simple calculation using the chain „ ||X - f,(^(X))!P - ||X - f(At,(X))|P
— ,
rule and the fact that <y - f(Af(y)), f'(^/(У»> = 0 (Lemma 6.1)


Expanding the first norm as before, we get sequences (A,, v,) and w,) converging to (A0, 0), with x(A,,
у,) = x(&> w/)- Nevertheless, it is easy to see that (X0, 0) is a
- 2 ( X - f(Af,(X))) • g(Af,(X))
regular point of x and thus maps a neighborhood of (A0, 0)
> -2ЦХ - f(Af,(X))|| diffeomorphically into a neighborhood of f(A0)> which is a con-
tradiction.
> -2ЦХ - f(A0)ll, (A.3)
which is once again integrable. By the dominated convergence
Define T(f, r) = ( A x Br). Proposition 7 assures that there
theorem, the interchange is justified. From (A.l) and (A.2),
are no ambiguity points in T(f, r) and Af(x) = X for x G x(A,
Br)-
however, and because f and g are continuous functions, we see
that the limit lim^o Z, exists whenever Af,(X) is continuous in t Pick a density C(A) on A and a density ^(v) on B„ with J Bf
at t = 0. We have proved this continuity for a.e. x in Lemma vyf(v) = 0. The mapping x carries the product density C(A) •
4.1. Moreover, this limit is given by lim^ 0 Zt = - 2 [ X - ^(v) on A x Br into a density h(x) on T(f, r). It is easy to verify
Ш Ш ' g ( W ) , by (A.l) and (A.2). that f is a principal curve for h.
Denoting the distribution of Af(X) by hx, we get [Received December 1984. Revised December 1988.]

| Dh, f,)U„ = - 2 £ J ( £ ( X | Л,(Х) = A) - f(A)) • g(A)]. REFERENCES
(A.4) Anderson, T. W. (1982), "Estimating Linear Structural Relationships,"
Technical Report 389, Stanford University, Institute for Mathematical
If f(A) is a principal curve of h, then by definition E(X | Af(X) Studies in the Social Sciences.
= A) = f(A) for a.e. A, and thus Becker, R., Chambers, J., and Wilks, A. (1988), The New S Language,
New York: Wadsworth.
Carroll, D. J. (1969), "Polynomial Factor Analysis," in Proceedings of
| z > 4 M , ) U = 0 Vg e §B. the 77th Annual Convention, Arlington, VA: American Psychological
Association, pp. 103-104.
Conversely, suppose that Cleveland, W. S. (1979), "Robust Locally Weighted Regression and
Smoothing Scatterplots," Journal of the American Statistical Associ-
Ehx[E(X - f(A) | Af(X) = A) • g(A)] = 0, (A.5) ation, 74, 829-836.
for all g G §B. Consider each coordinate separately, and reex- Efron, B. (1981), "Non-parametric Standard Errors and Confidence In-
press (A.5) as tervals," Canadian Journal of Statistics, 9, 139-172.
(1982), The Jackknife, the Bootstrap, and Other Resampling Plans
Ehkk(X)g(X) = 0 V £ E § b . (A.6) (CBMS-MSF Regional Conference Service in Applied Mathematics,
No. 38), Philadelphia: Society for Industrial and Applied Mathematics.
This implies that к(Х) = 0 a.s. Etezadi-Amoli, J., and McDonald, R. P. (1983), "A Second Generation
Nonlinear Factor Analysis," Psychometrika, 48, 315-342.
Construction of Densities With Known Golub, G. H., and Van Loan, C. (1979), "Total Least Squares," in
Principal Curves Smoothing Techniques for Curve Estimation, Heidelberg: Springer-
Verlag, pp. 69-76.
Let f be parameterized over a compact interval A. It is easy Hart, J., and Wehrly, T. (1986), "Kernel Regression Estimation Using
to construct densities with a carrier in a tube around f, for which Repeated Measurement Data," Journal of the American Statistical
f is a principal curve. Association, 81, 1080-1088.
Denote by Br the ball in R p _ 1 with radius r and center at the Hastie, T. J. (1984), "Principal Curves and Surfaces," Laboratory for
origin. The construction is based on the following proposition. Computational Statistics Technical Report 11, Stanford University,
Dept. of Statistics.
Proposition 7. If A is compact, there exists r > 0 such that Milnor, J. W. (1965), Topology From the Differentiable Viewpoint, Char-
X | A x Br is a diffeomorphism. lottesville: University of Virginia Press.
Shepard, R, N., and Carroll, D. J. (1966), "Parametric Representations
Proof. Suppose that the result were not true. Pick a sequence of Non-linear Data Structures," in Multivariate Analysis, ed. P. R.
rt 0. There would exist sequences (At, v,) Ф (£ f , w t ), with ||v/|| Krishnaiah, New York: Academic Press, pp. 561-592.
< rh M < r„ and x(Xn c,) = w,). Silverman, B. W. (1985), "Some Aspects of Spline Smoothing Ap-
The sequences Af and £ have cluster points XQ and We must proaches to Non-parametric Regression Curve Fitting," Journal of the
Royal Statistical Society, Ser. B, 47, 1-52.
have X0 = because Stone, M. (1974), "Cross-validatory Choice and Assessment of Statistical
Ш = x«o, 0) Predictions" (with discussion), Journal of the Royal Statistical Society,
Ser. B, 36, 111-147.
= lim x(&, v,) Thorpe, J. A. (1979), Elementary Topics in Differential Geometry, New
York: Springer-Verlag.
Wahba, G., and Wold, S (1975), "A Completely Automatic French
= x(A0,0) Curve: Fitting Spline Functions by Cross-Validation," Communica-
tions in Statistics, 4, 1-7.
= f(Xo), Watson, G. S. (1964), "Smooth Regression Analysis," Sankhya, Ser. A,
and by assumption f does not intersect itself. So there would be 26, 359-372.

Principa

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Principa

Similar to Principa (20)

Principa